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43 Abstract 
-t— > 

S Many models for sparse regression typically assume that the covariates are known 
i completely, and without noise. Particularly in high-dimensional applications, this 

is often not the case. This paper develops efficient OMP-like algorithms to deal 
t-H with precisely this setting. Our algorithms are as efficient as OMP, and improve 

on the best-known results for missing and noisy data in regression, both in the 
high-dimensional setting where we seek to recover a sparse vector from only a 
few measurements, and in the classical low-dimensional setting where we recover 
an unstructured regressor. In the high-dimensional setting, our support-recovery 
^— . algorithm requires no knowledge of even the statistics of the noise. Along the way, 

\Q we also obtain improved performance guarantees for OMP for the standard sparse 

regression problem with Gaussian noise. 

CM 

1 Introduction 

Sparse Linear Regression, also popularly known as compressed sensing, deals with the problem of 
recovering a sparse vector from linear projections. These projections typically represent a physical 
measurement process, and as such, are often subject to noisy, missing or corrupted data. Standard 
algorithms, including popular approaches such as I 1 -penalized regression, known as LASSO, are 
not equipped to deal with incomplete or noisy measurements. Not surprisingly, blindly running 
such algorithms on corrupted data gives solutions of significantly compromised quality, and indeed, 
we provide some computational results corroborating precisely this natural expectation. 

Recently, attention has turned to large-scale settings, where the data sets can be very large, and high- 
dimensional. Significantly, data collection in these settings can be even further prone to missing, 
noisy or corrupted data. In the high-dimensional setting, in particular, problems of prediction and 
inference critically hinge on correct identification of a low-dimensional structure - sparsity, in the 
regression setting. Thus, algorithms that provably provide correct subset recovery are important, in 
addition to providing guarantees on ^ 2 -error. Finally, the push towards large-scale learning calls for 
stable, simple and computationally efficient algorithms. 

This paper focuses on precisely this problem. We present simple algorithms, no more complicated 
than the fastest algorithms run on clean (meaning, no additional noise, and no missing variables) 
covariates, whose statistical performance is as good as or better than any method known to us for 
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noisy regression in the high dimensional setting, and in the low-dimensional setting. Indeed, the 
two main settings we consider are the regime where we have more observations than dimensions, 
but the signal (the regressor) exhibits no special structure such as sparsity (the "classical" or "low- 
dimensional" regime), and then the high-dimensional setting where dimensionality far outnumbers 
the number of measurements, but the signal is sparse (the standard compressed sensing setup). We 
describe our setting and assumptions in detail below, as well as provide a brief summary of recent 
work done in this area. 

In the high-dimensional setting, our algorithm is a greedy (OMP-like) algorithm. We provide condi- 
tions under which this algorithm is guaranteed (with high probability) to identify the correct support 
of the regressor. Interestingly, our support recovery algorithm requires no knowledge of the statis- 
tics of the corruption, in contrast with all other work we are aware of. Once the support has been 
identified, the problem reduces to one from the classical regime, since in the typical high dimen- 
sional scaling for compressed sensing, we have sparsity k and n = klogp samples. This reduction 
to the low-dimensional setting turns out to be critical for both computational complexity, as well 
as stability and statistical performance. The central challenge in regression with noisy or otherwise 
corrupted covariates, X, is in estimating the covariance matrix X T X. This is exacerbated in the 
high-dimensional setting, where natural estimates may not even be positive semidefinite. In our 
setting, on the other hand, we estimate the support without requiring use of the covariance estimate, 
and hence avoid computational issues of non-convexity by moving directly to the low-dimensional 
setting. 



Related Work 



The problem of regression with noisy or missing covariates, in the high-dimensional regime, has 
recently attracted some attention, and several authors have considered this problem and made im- 
portant contributions. Stadler and Buhlmann [11] developed an EM algorithm to cope with missing 
data. While effective in practical examples, there does not seem to be a proof guaranteeing global 
convergence. Recent work has also considered the case of noisy covariates. Rosenbaum and Tsy- 
bakov [ 10 1 proposed a convex-programming-based algorithm, which can be viewed as a modified 
Dantzig selector for noisy covariates. While they obtain some theoretical guarantees, their error 
bounds seem to be fairly weak. Recently, Loh and Wainwright HI studied the problem of noisy and 
missing data, and proposed a projected gradient descent algorithm. They show that their algorithm 
achieves a better error bounds than previous works. The crux of their method relies on an estimator 
(in the high-dimensional setting) of the covariance X T X of the true sensing (or covariate) matrix 
X. In this sense, their work is related to work by Xu and You [ 14| who consider a similar estimator 
but for noisy-regression in the low-dimensional setting. Because this is a high-dimensional problem, 
and the true covariance X T X is rank-deficient, the natural estimator may be non-convex (not posi- 
tive semi-definite). A key contribution there is to show that despite the non-convexity, the projected 
gradient descent algorithm succeeds in finding a near-optimal solution. The non-convexity, however, 
seems to require knowledge of || i in order to maintain bounded iterates. 

More generally, the methods mentioned above are computationally more demanding than the sim- 
plest greedy methods that have proven theoretically and empirically successful in the clean-covariate 
case, i.e., when we have noiseless access to all the covariates. Orthogonal Matching Pursuit (e.g., 
lfT2l[T3l [3ll) is a greedy method that has proven remarkably effective. These methods, however, have 
not been extended to the noisy or missing covariates case. Moreover, to the best of our knowledge, 
even the clean covariate case of sparse regression does not have a complete analysis under the noisy 
setting where measurements (response variables) are received with additive Gaussian noise. Many 
papers (e.g., (4) or even more recent papers, e.g., Q) consider deterministic (I 2 - or £°°-bounded) 
noise, and then obtain results for Gaussian noise as a corollary; however these results seem to be 
weaker than required. The work in [5| considers the high-SNR case, so it is not clear that one can 
make use of the results there. The work in HI considers OMP with Gaussian noise, but to the best 
of our understanding, there seems to be a non-trivial issue with the proof. We explain this in detail 
below. 
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Contributions 



The present work develops a greedy OMP-like algorithm for the noisy case with noise: the setting 
where the measurements we receive have additive Gaussian noise, and moreover the version of 
the covariate (or sensing) matrix we get to see, has either additive noise or random erasures. Our 
algorithm is as efficient as standard OMP algorithms, and in the case of independent columns, our 
results are at least as good or better than any results available by any method known to us. While 
we conjecture that our results can be extended to the setting where the columns are not independent, 
and the sparsity is not known explicitly, we do not pursue this here. Specifically, the contributions 
of this paper are as follows: 

1. Low-dimensional regime: For the case where the number of measurements, n, exceeds the 
dimensionality of the regressor, k, we design simple estimators for both the case of noisy 
covariates, and missing covariates. Our estimators are based on either knowledge of the 
statistics of the covariate noise, knowledge of the statistics of the covariate distribution, or 
knowledge of an Instrumental Variable correlated with the covariates. For both the case of 
missing and noisy data, we provide finite sample performance guarantees that are as far as 
we know, the best availablePlln the case of Instrumental Variables, we are not aware of any 
rigorous non-asymptotic results. 

Finally, we note that our results for the low-dimensional setting require no assumptions on 
the independence of columns of the covariate matrix. 

2. High-dimensional regime: Next, we consider the standard high-dimensional scaling, when 
the regressor is /c-sparse, but of dimension p, where p greatly outnumbers the available 
measurements, n. We develop an iterative OMP-like algorithm. We give conditions for ex- 
act support recovery in the missing and noisy covariate setting, and provide £ 2 error bounds 
for the regressor. For the case of independent columns, our results improve on past results, 
to the best of our knowledge, both in terms of computational speed and performance. Inter- 
estingly, in the noisy X setting, our support recovery algorithm requires no knowledge of 
the statistics of the noise; thus, our results imply that we have distributionally-robust sup- 
port recovery. As far as we know (see also our simulations in Section [6} other algorithms 
require some knowledge of the corruption statistics for support recovery. 

3. In simulations, the advantage of our algorithm seems more pronounced, both in terms of 
speed, and in terms of statistical performance. Moreover, while we provide no analysis 
for the case of correlated columns of the covariate matrix, our simulations indicate that the 
impediment is in the analysis, as the results for our algorithm seem very promising. 

4. Finally, as a corollary to our results above, setting the covariate-noise-level to zero, we 
obtain bounds on the performance of OMP in the standard setting, with additive Gaussian 
noise. Our bounds are better than bounds obtained by specializing deterministic results 
(e.g., ^-bounded noise as in flU) and ignoring Gaussianity; meanwhile, while similar to 
the results in |Q], there seem to be gaps in the proof that we do not quite see how to fill. 

Paper Outline 

The outline of the paper is as follows. We first consider the low-dimensional or classical statistical 
setting in Section [3] Here, the number of samples exceeds the dimensionality of the regressor. 
While important in its own right, this is critical for an OMP-based approach to sparse regression, 
since once the greedy approach has determined the (sparse) support set, the resulting problem is 
indeed a low-dimensional regression problem. Section [4] introduces the high-dimensional setting, 
and our OMP-like algorithm. Here we state the main results for this regime. Section[5]contains the 
proofs of our main results. Section [6] illustrates the performance and advantages of our algorithm 
empirically, and compares performance to other methods. 



'Xu and You 1 14] have shown asymptotic performance guarantees for some of the estimators we use, but 
have no finite sample results. 
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2 Problem Setup 



The main focus of this paper is on the high-dimensional setting. However, an important intermedi- 
ate point is the consideration of the low-dimensional regime, since once the support of the sparse 
regressor has been identified, estimating those non-zero coefficients amounts to a low-dimensional 
problem. We thus define the setup in both settings. 

We denote our unknown regressor (or signal) as f3*. In the low-dimensional setting, we have /?* g 
R k , where as in the high dimensional setting, j3* € W is an unknown /c-sparse vector. For i = 
1, . . . , n, we obtain measurement 6 K according to the linear model 

Hi = (xf,/?*) +e,, i = 1, . . . ,n. 

Here, Xj is the covariate vector of appropriate dimension (Xj £ M fc for the low-dimensional case, 
and in W p in the high dimensional case), and e,; 6 K is additive error. 

The standard setting assumes that each covariate vector Xj is known directly, and exactly. Instead, 
here we assume we only observe a vector z$ € M. k (or R p ) which is linked to Xj via some distribution. 
We focus on two cases: 

1. Covariates with additive noise: We observe z, = Xj + Wj, where w t 6 M fc (or E p ) is the 
noise. 

2. Covariates with missing data: We consider the case where the entries of Xj are observed 
independently with probability 1 — p, and missing with probability p. In particular, we 
assume the following model: 



Thus, in matrix notation, we have 



(J w.p. p 



Xf3* 



and we get to see: (y, Z), where Z = X + W in the noisy case, and Z is the entry-erased version of 
X in the missing data case. Thus, given {z^} and {yi}, we seek to estimate the unknown vector j3*. 
A task of particular importance in the high-dimensional setting is to estimate the support of (3* . We 
use Xj, Zj and w, to denote the ith row of X, Z and W, respectively, and Xj, and Wi to denote 
the ith column. 

In this paper, we consider the case where both the covariate matrix X and the noise W and e are 
sub-Gaussian. We give the basic definitions here, as these are used throughout. 

Definition 1. Sub-Gaussian Variable: A zero-mean variable v is sub-Gaussian with parameter a v > 
if for all Aet, 

E[exp(Au)] < exp(cr^A 2 /2). 

Definition 2. Sub-Gaussian Matrix: A zero-mean matrix V is called sub-Gaussian with parameter 
io- 2 ) if both of the following are satisfied: 

1. Each row g R p of V is sampled independently and has E [v,v^"] = ^5]F] 

2. For any unit vector u £ W, u T Vj is a sub-Gaussian random variable with parameter at 
most -^=a. 

Note that the second parameter ^er 2 is an upper bound. 

In Section [3] we need only a general sub-Gaussian assumption in order to guarantee our results. 
Thus we define: 

Definition 3. Sub-Gaussian Design Model: We assume X, W and e are sub-Gaussian with param- 
eters (^Ej, i), (;^iuj n a w) an d (n " 2 ; ~ a e)> respectively. We assume they are independent of 
each other. 



2 The - factor is used to simplify subsequent notation; no generality is lost. 
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For our analytical results in the high-dimensional setting, we currently require independence across 
entries for X, W and e. Thus we define: 

Definition 4. Independent Sub-Gaussian Design Model: We assume X, W and e have zero-mean, 
independent and sub-Gaussian entries. The entries of X, W, and e have parameter — , ^er^ and 
~<7g, respectively. 

Finally, we discuss our notation. In this paper, for the simplicity of exposition of the results but also 
of the analysis, we disregard constants that do not scale with k, n or p or any relevant variance a 2 . 
Thus, for example, writing n > f(k,p, a) means n > cf(k,p, a) for a positive universal constant c 
that does not scale with k, n, p, or a. 

The results we present in the sequel all hold with high probability (w.h.p.). By this, we mean with 
probability at least 1 — Cip~° 2 , for positive constants C\, C2 independent of n, p, k and a (i.e., all 
relevant variance quantities). 

3 The Low-Dimensional Problem 

We first consider the low-dimensional version of the problem where /3* € R fc , with k <C n. As noted 
above, in the high-dimensional sparse-regression setting, once we know the support of j3*, this is 
precisely the resulting problem. When k -C n, the problem is strongly convex, and in the clean- 
covariate setting where we know X exactly and completely, the solution is given by the standard 
least-square estimator: 

$ = (X T X)- 1 X T y = argmin (3 T (X T X)f3 -2y T X (3. (1) 

In this setting, well-known results establish, among other measures of closeness to /?*, the following: 

Theorem 5 ([9]). Suppose that (according to the sub-Gaussian design model defined above) X is 
sub-Gaussian with parameters (— E s , -), and the noise vector e is sub-Gaussian with parameters 

( n " 2 ' n a e)- Moreover, suppose that n > x £ ^ . Then with high probability, the estimator above 

satisfies: 

A m in(S x ) V n 

The challenge in our setting is that we know only Z (a noisy or partially deleted version of X), 
and hence cannot solve for (3. Some knowledge of X or of the nature of the corruption (W in the 
case of additive noise) is required in order to proceed. For the case of additive noise, we consider 
three models for a priori knowledge of X or of the noise. For the case of partially missing data, we 
assume we know the erasure probability (easy to check directly). For additive noise, the models we 
use are as follows. 

1 . Noise Covariance: in this case, we assume we either know or somehow can estimate the 
noise covariance, T, w — E [W^W 7 ] . We note that this is a typical assumption, e.g., ll8l [T4l . 
We also give results for when we can only conservatively estimate T, w , We are unaware of 
other such results. We discuss this further below. 

2. Covariate Covariance: in this case, we assume that we either know or somehow can es- 
timate the covariance of the true covariates, H x — E [X T X] . This assumption does not 
seem to be as common as the previous one, although it seems equally plausible to have an 
estimate of E [X T X] as of E [W T W] . 

3. Instrumental Variables: in this setting, we assume there are variables U G l nxm with 
m > k, whose rows are correlated with the rows of X, but independent of W and e, and 
that the realization of U is known or can be estimated. Instrumental variables are common 
in the econometrics literature 1012], and are often used when X is not available. To the best 
of our knowledge, no rigorous non-asymptotic results are available when one has a noisy 
or partially erased version of the covariate matrix X. 

We note in this section, our results require no assumptions on the independence of the columns of 
X, W, or, therefore, of Z; that is, we assume we operate under the sub-Gaussian design model. As 
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we discuss in more detail in Section 4] our subset selection algorithm is iterative, and empirically 
works well for correlated or independent columns, although currently our analytical results (perfor- 
mance guarantees) do require this independence assumption, and hence the guarantees hold for the 
independent sub-Gaussian model. 

Let us generically denote by £ the estimator for X T X, and by 7 our estimate for X T y. The pair 
(£, 7) depends on the assumption of what is known, i.e., according to the three possibilities outlined 
above. Thus, in place of /§ = (X J X)^ 1 X 1 y given in |T|), our proposed estimator for /3 naturally 
becomes: 



argmin/? 1 (E)/3 - 2<y 1 /?, 



where we require £ to be positive semidefinite. For this estimator, we have the following simple but 
general result. 

Theorem 6. Suppose the following strong convexity condition holds: 



A > 0. 



Then the estimation error satisfies 



< I 

~ A 



7-s/r 



Proof. Let A = /3 - /3*. By optimality of p), we have (f3* + A) T (£)(/?* + A) - 2 7 T (/3* + A) < 
/3* T (£)/3 - 2j T /3*. Rearranging terms gives A T £A < 2(7 T - /3* T £)A. Under the strong 
convexity assumption, the l.h.s. is lower-bounded by A||A|L. The r.h.s. is upper-bounded by 



7~£/3* 



I A|| 2 thanks to Cauchy-Schwarz. The result then follows. □ 



This result is simple and generic. We specialize this bound to the case of additive noise, and missing 
variables, and in particular, in the case of additive noise we specialize it according to each estimator 
we form depending on the information available (as discussed above). 



3.1 Additive Noise 

As outlined above, we have three different approaches for obtaining information on X T X, and 
hence three different estimators, depending on what information we have available. Given this 
information, the resulting estimator is quite natural. We list these here, and subsequently provide 
the results obtained by specializing the main theorem above. 

1. If £„> is known, we use £ = Z T Z — S w , 7 = Z T y. We note that this estimator has been 
previously studied in [14], but their analysis is asymptotic. Here we give finite-sample 
bounds. 

2. If Ti x is known, we use £ = T, x , 7 = Z T y. This is simple, and as our results show, in 
certain regimes its performance improves that of Z T Z — Yl w . While simple and natural, 
we were not able to find previous analysis with performance guarantees. 

3. An Instrumental Variable (IV) U £ jj nxm ( m > fc) is a matrix whose rows are correlated 
with the corresponding rows of X but independent of W. If such an IV is known, we 
use E = Z T UU T Z, 7 = Z T UU T y. While the use of instrumental variables is popu- 
lar in the economics literature, we are unaware of previous analysis with non-asymptotic 
performance guarantees. 

Remark 7. While assuming knowledge of E^ is common, and indeed a central focus of this paper, 
in some applications it may only be reasonable to assume knowledge of an upper bound on the 
noise covariance. That is, we may only be able to obtain some estimate E w such that E w y E TO . 
Our algorithms and analysis carry over in this case, providing a somewhat weaker (as expected) 
guarantee for i 2 error. 
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The results we present hold with high probability, where recall that by with high probability (w.h.p.) 
we mean with probability at least 1 — Cip~° 2 , for positive constants C\, C2 independent of n, p, k 
and a. While in the reduction from the high-dimensional setting, the parameter p has a natural inter- 
pretation as the original number of covariates (i.e., the dimension of /?*), here it is just a parameter 
that indexes the guarantees, trading off between accuracy and reliability. 



> 



Corollary 8 (Knowledge of T, w ). Suppose n ^ x - 2 ^ ^ 
estimator built using E = Z T Z — E w and 7 = Z T y, satisfies 



klogp for p > k. Then, w.h.p., the 



(3-/3* 



< 



(a w +CT 2 w ) \\/3*\\ 2 +<T e y/TT. 
Amin (Ea; ) 



k \ogp 



Remark 9. Note that when a w = 0, the bound reduces to the standard bound for the least-squares 
estimator; in particular, it implies exact recovery when a w = a e = 0. Also, compared to existing 
results in lfl4l [8L our bound makes clear the dependence on \\(3* || 2 . This is important and intuitive 
since W is multiplied by (3* . 

Remark 10. If we only have an upper bound, E m y H w , then using the same analysis one can show: 



P~P* 



< 



(a w + al) \\l3*\\ 2 + a e ^/T+rt\ + A max (E M - X w )\\/3*\\ 2 



with high probability as long as the denominator is positive. Note that, as one might expect, the 
result is not consistent. Nevertheless, it allows us to quantify precisely the value of better estimation 
of the noise covariance. 

Corollary 11 (Knowledge of S x ). Suppose n > logp. Then, w.h.p., the estimator built using 
E = Yi x and 7 = Z T y, satisfies 



13-/3* 



< 



(l + a w )\\l3*\\ 2 + a e ^/l + a l Jklogp 

^min (E^ ) 



Remark 12. (1) We only require n > logp (the case where we use E m for our estimator requires the 
much more restrictive n > (1 + <r^) 2 fc logp). The reason for this, is that here we don't estimate E 
from data, and hence do not require f2(fclogp) samples in order to control A m j n (E) by A m i n (E x ), 
as is required in the previous result. (2) The bound is linear, rather than quadratic, in a w (when a w 
is large), but it does not vanish when a w and a e are zero. 

Remark 13. The projected gradient method in Loh and Wainwright (SI can be modified to use E^ as 
the covariance estimator, and when we extend their algorithm in the natural way, a similar analysis 
yields the same error bound. 



Suppose E^ = /. Comparing the above bounds: 



Knowledge of E^: 
Knowledge of E^: 



p- 


13* 


< 






2 ~ 


p- 


13* 


< 
2 ~ 



(a w +al) \\(3*\\ 2 + a e y/l + al 
(l+a w ) \\P*\\ 2 + v e y/l + <T* 



k logp 



k logp 



we see that the only difference is er^ vs. 1. The cr^ term arises from (W T W — E w ) while the 
1-term comes from (X T X — E^). The first bound is better when cr 2 < 1, and the other way around 
when d 2 w > \. This suggests the following strategy: if we somehow know (or can estimate) both the 
variance of X and W, then we should use the first estimator if erj, < 1, otherwise use the second 
estimator. This gap in performance according to different regimes is in fact present, as confirmed in 
our simulations in Section [6] 

For the final result in this section, we use the following standard notation: For a matrix A, we let 
cri(A) be the i-th singular value, so, e.g., cri(A) = er max (/l). 

Corollary 14 (Instrumental Variables). Suppose the Instrumental Variable U € M. nxm is zero- 
mean sub-Gaussian with parameter (— E;/, \cr^), and E [?7 T X] = T,jjx- Let <j\ — <Ti(Eux) 
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and fifc = Ufc [Sux)- If n > max|l, ^/k)^ J ^^°SP> r ^ en w.h.p. the estimator built using 
E = Z T UU T Z, and 7 = Z T UU T y, satisfies 



Remark 15. Consider the term 



< 



<4 11/3*11 



A: logp 



(cti/ct«) 



1 



The first factor can be interpreted as 1/SNR, and the second is a measure of the correlation between 
X and U (i.e., the strength of the Instrumental Variable). 



3.2 Missing Data 

As we do for noisy data, in the missing data setting we consider the sub-Gaussian Design model, 
so that X is sub-Gaussian with possibly dependent columns. We assume that each entry of the 
covariate matrix X is missing independently of all other entries, with probability p. Again, the key 
is in designing an estimator for X T X. We consider the natural choice, given our erasure model: We 
use E = (Z T Z) M and 7 = r^zr^ Z T y, where Mij = if i = j or njpp otherwise; here 
denotes element-wise product. 

Corollary 16 (Missing Data). Suppose n > p^^p — -^-^klogp. Then, w.h.p. our estimator 
satisfies 

Remark 17. Note that as with the previous results, the dependence on ||/3* || 2 is given explicitly. 



4 The High-Dimensional Problem and Orthogonal Matching Pursuit 

We now move to the high-dimensional setting. Given the results of the previous section, the main 
challenge that remains in the high-dimensional setting is to understand precisely when we can re- 
cover the correct support of j3* . With this accomplished, we can immediately apply the results of 
the low-dimensional setting. It is this which spares us from having to compute an estimate of X T X 
in the high-dimensional setting, thus allowing us to avoid issues with non-convex optimization. 

The main contribution in this section is showing that our simple approach enjoys, as far as we know, 
the best known support recovery guarantees in the noisy and missing variable setting. 

Our algorithm is iterative, and we would expect, as our experimental results in Section[6]corroborate, 
that it performs well despite correlation in the columns of X and W. However, the line of analysis we 
develop in order to prove our performance guarantees, seems to require this independence. Hence, 
the results in this section all assume the Independent sub-Gaussian design model, i.e., the entries of 
X and W are assumed independent of each other and everything else. 

Interestingly, this section shows that estimating the support of /?*, can be done without using knowl- 
edge of E lu , E x . or an instrumental variable, but instead using Z directly. A critical advantage to 
this approach is that while we pay a price in £2 errors for not having an accurate estimate of E w (as 
in the remark of the previous section), our support recovery is distributionally robust, in the sense 
that our algorithm does not need to know E m or E x in order to recover the support (our guaran- 
tees, of course, are in terms of these quantities). Indeed, we use Z directly for deciding on the next 
element to add to the support set. We note that this is not that surprising, given that the error is as- 
sumed to be unbiased (in particular, in dependent of X and j3*) and standard OMP adds the element 
corresponding to largest correlation. Somewhat surprisingly, we also use Z directly to estimate the 
residual, rather than use the estimators of the previous section. The advantage is that this allows 
distributionally-robust (i.e., distributionally oblivious) support recovery. Without this approach, we 
would have to control the propagation of the additional error obtained from conservative estimates 
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of the noise. It seems that other algorithms in this space, e.g., [8| require knowledge of H w . We 
test the distributional robustness (i.e., the effect of not knowing the correct H w ) in our simulations 
results in Section [6] where our results show that support recovery in [8 1 deteriorates if H w is under- 
or over-estimated. 

We consider the following modified OMP algorithm (Algorithm [T}. Note that the support recovery 
step is identical to OMP, where as in the final step recovering $ is done using the estimators of the 
previous section. Given a matrix Y, we use Yj to denote the sub matrix with the columns of Y 
with indices in I, and Y] j the square submatrix of Y with row and column indices in I. We note 
that unlike some of the latest results on OMP, we do not deal with stopping rules here, but instead 
simply assume we know the sparsity (or an upper bound of it). We believe that this can be relaxed 
by adapting the by-now standard stopping rules, although we do not pursue it here. 



Algorithm 1 mod-OMP 

Input: Z,y,k 

Initialize 7 = 0, 1° = {1, 2, . . . , p}, r = 0. 

For j = 1 : k 

Compute corrected inner products hi = Zjr, for i € 

r. 

Let i* = argmaxig/c \hi\ . 
Set I = lU{i*}. 

Update residual r = y — Zj(Zj Zj)' 1 Zjy . 

End 

Set P as: 

hi = tjfa- 
/3 7 c = 0, 

where (S/,7,7/) are computed according to knowledge of 
E x , T, w or U. 
Output 0. 



4.1 Guarantees for Additive Noise 

Theorem 18. Under the Independent sub-Gaussian Design model and Additive Noise model, mod- 
OMP identifies the correct support of j3* with high probability, provided n > (1 + cr^) 2 fc logp and 
the non-zero entries of ft* is greater than 



16 (a w \\/3*\\ 2 + a e ) 

Remark 19. (1) If k is an upper bound of the actual sparsity, then mod-OMP identifies a size- 
k superset of the support of j3* . (2) Considering the clean-covariate case, a w = 0, we obtain 
results that seem to be stronger (better) than previous results for OMP with Gaussian noise and 
clean covariates. The work in 1 1 1 obtains a similar condition on the non-zeros of (3* , however, their 
proof seems to implicitly require an independence between the residual columns and the noise vector 
which does not hold, and hence we are unable to complete the argument]^] 

Once mod-OMP identifies the correct support, the problem reduces to a low-dimensional one, which 
is exactly what we discuss in the previous section. Thus, applying our low-dimensional results, we 
have the following bounds on £2 error, complementing the above results on support recovery. 

Corollary 20. Consider the Independent sub-Gaussian model and Additive Noise 
model. If n > (1 + er^) 2 fclogp and the non-zero entries of /3* are greater than 

3 More specifically, in 1 1 1, the proof of Theorem 8 applies the results of their Lemma 3 to bound ||X T (7 — 
X I (XjXi)- 1 Xl )e)||oo. As far as we can tell, however, Lemma 3 applies only to ||X T e|| 00 thanks to 
independence, which need not hold for the case in question. 
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(a w || 2 + g"e) \/l ~^ a w\J kl °n^ ~ i then with high probability, the output of mod-OMP with 
estimator built from (£,7), satisfies: 



1. (Knowledge of "E w ): ft — f3* 

2. (Knowledge of T, x ): ft — ft* 

3. (Instrumental Variables): 



< 



< 



2 



{cy w +al)\\P*\\ 2 + a e ^l^l 
(l + a tu )||/3*|| 2 +c7 eV /rT^! j 
<(a w \\ft*\\ 2 + a e )^ /•*' 



k log p 



/k/r 



4.2 Guarantees for Missing Data 



We provide a support-recovery result analogous to Theorem 18 above. 

Theorem 21. Under the Independent sub-Gaussian Design model and missing data model, mod- 
OMP identifies the correct support of ft* provided n > fjzrfii k logp and the non-zero entries of ft* 
are greater than 

— p (\W 11, + ^)^ — • 

Remark 22. There is an unsatisfactory element to this bound, as it does not recover the clean- 
covariate results as p — >• 0. Since the algorithm reduces to the standard OMP algorithm, this is a 
short falling of the analysis. At issue is a weakness in the concentration inequalities when p is small. 
We note that the leading optimization-based bounds for performance under missing data, given in 
[8 1, seem to face the same problem. 

We can combine the above theorem with Corollary [T6]to obtain a bound on the I2 error. 
Corollary 23. Consider the Independent sub-Gaussian model and Missing Data model. If n > 

(jz^ji k 1°S P an d tne non-zero entries of ft* are greater than (|| ft* || 2 + cr e ) "J^^> then using 
our estimator for missing data, the output of mod-OMP satisfies 



ft -ft* 



< 



1 



(1-P) 2 



1/3* 



1 



fc logp 



5 Proofs 



We prove all the results in this section. We make continued use of a few technical lemmas on 
concentration results. We provide the statement of these lemmas here, but postpone the proofs to the 
Appendix. 



5.1 Supporting Concentration Results 

Lemma 24. [8. Lemma 14] Suppose Y e I 
(iS, ia 2 ). Ifn > logp > log A;, then 



ix k 



is a zero mean sub-Gaussian matrix with parameter 



\Y T Y - S| 



> c a' 



logp 



< a exp (-c 2 logp) . 



Lemma 25. Suppose X e R nxk , Y e M. nxm are zero-mean sub-Gaussian matrices with parame- 
ters (iS x , icr 2 ), (iS y , icr 2 ). Then for any fixed vectors Vi, v 2 , we have 



P(\vJ (Y T X-E [Y T X]) v 2 | > * || Vi || ||v 2 ||) < 3cxp ( -c 
In particular, ifn > logp > log m V log k, we have w.h.p. 



|vj" (Y T X E [Y T X]) v 2 | < a x a y || Vl || ||v 2 || J 
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Setting vi to be the i th standard basis vector, and using a union bound over i = 1, . . . , m, we have 
w.h.p. 



{Y T X-E[Y T X])v\\ oo <a x a y \\v\ 



logp 



As a simple corollary of this lemma, we get the following. 

Corollary 26. If X £ M. nxk is a zero-mean sub-Gaussian matrix with parameter (i£ x I, ~o%), 
and v is a fixed vector in W 1 , then for any e > 1, we have 



|X T v|L > \ r 1 + e)k o x \\w\\ 2 \ <3exp(-cfce). 



are zero mean sub-Gaussian matrices with parameter 



e t 



Lemma 27. If X e R nxk , Y e M nxm 



P sup |v T (Y T X - E [Y T ^]) v 2 | > i < 2exp -cnmin( 

\vieE m ,v 2 eR fc ,||vi||=||v2||=i ' / V 

f a 2 a 2 1 

/« particular, for each A > 0, ifn > max < , 1 > (k + m) logp, then w.h.p. 

sup \vj (Y T X-E[Y T X])v 2 \ < 1a||v 1 ||||v 2 ||. 

vi6R m ,v 2 SM fc 04 

5.2 Proof of CoroUary|8] 

Proof. Substituting Z = X + W into the definition of 7 and S, we obtain 



0-2(72 ' ax(Ty 



+ 6(fc + m) 



Using Lemma [25] we have w.h.p 



= \\-X T WP* +Z T e- (W T W -E^P*^ 
< \\X T W/3*\\ + ||Z T e|| + \\{W T W -E w )f3* 



\X T W(3*\ 



< °w\\P\\ 



logp 



\\Z T e\ 

\\(W T W-Z w )p*\ 



logp 



It follows that 

1-tp* 



< Vk 



< 



< dm n ° sp 



\\P*\\ 2 +(T e y/l + <Tl 



k logp 



On the other hand, observe that Z is sub-Gaussian with parameter (i£ x + h^w> r(l + &w))- 



When n > (1 + 2 J ( ^° SP , by Lemma 27 with A = A mln (£ x ), we have Ai (Z T Z - (E x + S ro )) < 
g2A m i n (E x ) w.h.p. It follows that 

A m in IS) = inf « T Su = inf v T CE X + Z T Z — (E x + v 

V / ||v||^i Ml— 1 

> A m in(S 2; ) — Ai [Z? Z — (S x + Eu,)) 



^ 2^min(5j x ). 

The proposition then follows by applying Theorem|6] 



□ 
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5.3 Proof of Corollary[TT] 

Proof. In this case, we have 



I (X T X - E x )j3* + W T X(i* + Z T e\ 



< \\w T xp* 



\Z T e\\ + \\(X T X-Z x )f3*\ 



By Lemma [25] we have w.h.p. 



\W T Xj3* 



< <T V \\P\\ 



\ogp 



\Z T e\\ < a e Jl + a 



\(X T X-X x )f3*\\ < 



2 , l l0 &P 

w 

\ogp 



So 



7 -E/3* 



7 -E/3* 



< 



(l + a w )\\l3*\\ 2 + a e ^l+al 



k log p 



. On the other 



hand, by assumption A min (E) = \ min (E x ). The proposition then follows by applying Theorem 
□ 

5.4 Proof of Corollary[l4] 



Proof. First observe that 

A m in(E) 



A m i„((X + ^) T C/C/ T (X + V^)) 



= a 2 k {U T X + U T W) 

= al(E[U T X] +{U T X -E[U T X]) + U T W) 
> [a k - cri([/ T X-E [U T X]) -cri(C/ T VF)] 2 . 



By Lemma 27 with A = a k , we have cri(C/ T VK) < \a k and cti(£/ t X - E [U T X] ) < \o k under 



our assumption, so A m ; n (E) > |<r^. On the other hand 



s/r-7 



We bound each term. 



(X + W) T UU T (Wf3* -e) 



< \\X T UU T (W/3* + e)|| + \\W 1 UU 1 (W/T + e) 



TttttT, 



1. By Lemma 27 with A = oi, we have ai(U T X) < |ai. Each entry of Wf3* + e is i.i.d. 



zero-mean sub-Gaussian with variance bounded by crj ||/3*|| + Cg. Hence by Lemma 

2 A / m log p 



25 



t/ T (VF/3*+e) < ^m\\U T {WP* + 



follows that 



(U^X) T U^(W/3* + e) < 2 v /^||^|| 2 + ag V / ^ losp . 



2. By Lemma 27 with A = o"i, we have || W U\\ < o\ under the assumption, so the second 



□ 



term is bounded by a w \ \ f3* 1 1 J CT " CTl " losp . 
The result follows from applying Theorem|6] 
5.5 Proof of Corollary[l6] 

Proof. Let E, = E [Z T Z}- we have (E,)y = (l-p)(E x ) w fori = j and (£,) y = (l-p) 2 (E x )y 
for i 7^ j. Note that the observed matrix Z is sub-Gaussian with parameter (— E z , -), which follows 
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from the sub-Gaussianity of X (c.f. 0). We set A 2 = Z T Z — S z . By Lemma 24 we know 
rnaxj |(A Z )^| < |(1 — p) 2 A m i n (S x ) w.h.p. When this happens, for each unit norm v, we have 



< 



< 



(l-p) 2 
,v T A z v + \\v\\ 



(l-p) 2 Z (1-P) 2 
v T A z v + -rA min (S x ). 



max|(A z ) i? 



By Lemma 27 with A = 4(1 — /9) 2 A m i n (E K ), we obtain max„.||„|| =1 v T A2?; < — /9) 2 A m i n (S a 



sow T (A z 0M)w < ^(l+p)A min (S x ). Because S = (E z + Z T Z - S z ) M = E x + A 2 0M, 
it follows that A min (E) > A min (E x ) - Ax (A, M) > |A mln (S x ). 

On the other hand, observe that 



< Il7-S^*|| 



< 



1 



(S - E x )/3* 



By Lemma 



25 



w.h.p. the first term is bounded by j-jr - 



1-P 

^2 > anc [ tjjg second term is bounded 



b y^V » ■ The magnitude of the i-th term of (£ - E^)/?* is 

|p T 2-E[Z T Z]),_0M,^*| = |(Z T Z-E[Z T Z]) 4 _(Af i 1 l0/3*)| 

< ||(Z T Z-E[Z T Z])(M i T l0/3*)|| o< 

Note that we use Mj_ to denote the i* 71 row of the matrix M. 
Thus, by Lemma [25] and union bound over i, we have 



< 



< 



max \\{Z T Z-E \Z T Z])(Mj_ Q /3*)\ 

i—l,...n " 1 

J^maxllAfJ 0/TIL 
V n i z 

n 



(i-p) 2 



logp 



Combining pieces, we have 



■ Vk 



< 



1 



,(l-p) 2 

The corollary follows by applying Theorem [6] 
5.6 Proof of Theorem[l8] 



k \ogp 



□ 



We use induction. The inductive assumption is that the previous steps identify a subset / of the true 
support I* = supp(/3*). Let I r = I* — I be the remaining true support that is yet to be identified. 
We need to prove that at the current step, mod-OMP picks an index in I r , i.e., \\hj \\ > \hi\ for all 
I G (/*) c . 

We use a decoupling argument similar to lfl"3"l : consider the oracle which runs mod-OMP over only 
the true support I* . Then our mod-OMP identifies I* if and only if it identifies it in the same order 
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as the oracle. Therefore we can assume 7 to be independent of JQ and Wi for all i £ (I*) c . Note 
that 7 may still depend on Xj* , Wi* , and e. 

Define Vi = Z r (Zj Z/)" 1 Zj . We have 
\h Ir \\oo 

Zj r (I - ViHZj.fi. - Wrft. + 
Zj r (I - V^ZjJt - Wj,/3*j, + e)^ 
(a) ||X7(7 - Vi)X Ir m + Wj(I - Vi)X Ir m - Zj(I - Vi)Wjh + Zj(I - Vi)e\\ 



> 



1 



\xj r (l - V^XjJlW^WwKl - V^XjJlW^WzKl - VNWrf! - e)\\0 



y/k — i 

where (a) follows from substituting Z = X + W. For the first term, we have the following lemma. 
Lemma 28. Under the assumptions of Theorem 18 w.h.p. V7i C 7* , 7f = I* — 7i, 



Proof. By Lemma 27 
other hand, fixing 771 



xj t {i-v h )x lt ) > - 
\ ma *(wJ t {i -VjjX^) < i 

and a union bound, we have w.h.p. V7i C 7*, A m i n ^XjcXi^j > 1. On the 



I* we have 



XjoV^Xjc 



Again by Lemma 

1 - exp(cn IrF ij 

I - exp (cn^^T 
Xj(P H X n 



27 



XjcZ Il {Z\Zi^) Zj x Xq 

< °1 {XjcZIl) /cr min (Z^ZiJ . 

Cmin (Zj ZiA > 5(1 + olf) with probability at least 



+ 12k 
12k 



and af[X^Z h 



< 



with probability at least 



So a union bound over all I\ yields w.h.p. Vii C 7*, 



< |. It follows that 



(Xj^l -V^Xtc 



XjcXjc)- xj.v h x q 



27 



and the union bound, we have w.h.p. Vii C 7*, 



Similarly, by Lemma 
WjoPi^Xj, < hence A max [Wjc{I - PjJXj.) < W^Xj; 



> -. 

op 4 

WjcXjc 



Op 



< i and 

16 



< 



□ 



Therefore, the first term in (J2j> is lower bounded by j ||/3j || 2 , and the second term is upper bounded 

by|||^|| 2 . 

Now consider the third term in[2j By Lemma 27 and a union bound, we have w.h.p. o\ (Wi 1 ) < |cr w 
for all Ii. Lemma 25 gives ||e|| 2 < |cr e . It follows that ||(7 - Vi 1 )(Wi 1 f3 Il - e)|| 2 < <ti(7 - 

^J^iCW/Jll^J^ + Hellj,) < §(^11/3^2 + ^). Seti^ = (7 - Vi){Wifr - e). Be- 
cause Z/c and are independent, Corollary 26 gives Z7= < y ^ 1+€ ^ k ^ 1+a ^ {{v^ || 2 
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with probability at least 1 — 3exp (cfce 2 ). Using a union bound over all I\, we conclude that the 
third term is bounded w.h.p. by 4 y / ( 1 +0(fc- _j)iggp (a w \\fi\\ 2 + a e ) . 
Combining the above bounds, we have 
1 



> 



1 \\R* II 1 \\R* II A . ( l + (T l)( k - i ) lo &P („ 110*11 



which is greater than g ^l— ^ ||/3J || 2 if all the non-zero entries of /3* are greater than 

i6Knrii 2 +^)V (1+ i losp - 

On the other hand, by similar argument as above we have ||(7 — 'Pj)(Zj r (3j r — Wj*f3j* + e)|| 2 < 
I (ll^jJIa + Gw + °eY Note that for each i € 7* c , Z t is independent of Z r , Xj* and e. 

Applying Corollary |26]gives w.h.p. 

N = |Z i r (/-'P / )(X / ./? / . +e)| 
7 T 



Z 4 ' (7 - Vi){Z Ir p Ir - Wi.fr + e) 



< 4 y (J|Pz r || 2 + a w \\Plh + a e) , 

which is smaller than ^J^i || 2 provided n > (1 + er^ ) 2 k log p, and the nonzeros of /3* are 
greater than 4 (cr^ ||/3*|| 2 + cr e ) \J / ( 1+cr ™^ iggP _ Using a union bound shows this holds for alH e 7* c . 
We conclude that H/i/JI^ > for all i £ I* c w.h.p. This completes the proof. 

5.7 Proof of Theorem|2T] 

Proof. Note that the entries of Z are i.i.d. sub-Gaussian random variables with parameter Jl^. 
Similarly to the proof of Theorem [18] we use induction, the decoupling argument, and the same 
notation. Therefore, it suffices to show H/i^H^, > \hi\ for all i 6 (7*) c . 

We have 

IIMoo = \\Zj r (I-V I )(X I .^+e)\\ oo 

> \\Zj r (I - Vi) (XjJl + (X! - Zi)fi + e) || 2 

> (\\zj r (l - P!)X Ir ^ r \\ 2 + \\zj r (l - Vj)(X! - Zi)pl\\ 2 - \\zj r (l - V!)e 



27 



We also have 



Consider the first term. We have X m i n (Zj Xi r ) > §(1 — p) by Lemma 
X^iZjZi) > i(l - p), a^ZjZi) < pf, o x (Z~]X It ) < |(1 - p) 2 by the same lemma. 

It follows that Ai(Z£PjX Jr ) = X 1 (Zj r Z I (Zj Z^ 1 Z] ' X Ir ) < a 2 (Zj Z^/X^Zj Zj) < 
1(1 - p) 3 . We conclude that A min (Z / T r (7 - Vi)Z Ir ) > X min {Zj Z T ) - X 1 (ZjV I Z Ir ) > p). 
So the first term is at least 11/3? |L. 

4Vfc-l ll^r 112 

For the second term, we apply Lemma [27] to obtain that w.h.p., a\{Xj — Zj) < 2. It follows that 

\\{I-V I )(X I -Z I )p J \\ 2 < 2\\(3j\\ 2 . 
By Corollary [26] and a union bound, we obtain 



\zKi-PjKXj zML < 2||/j f || 2 J (fc l)logp 



which is smaller than ^§^(1 — p) ||/3| |L if the non-zeros are bigger than j^- \\0i\\ 2 
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Consider the third term. In the proof of Theorem 18 we have shown that ||e|| 2 < a e , so 
|| (/ — Vi)e\\ 2 < a e . w.h.p. By Corollary 26 and a union bound, it follows that \\Zj (I — 'P/)e|| 2 < 



(k-i) logp 



<7 e , which is smaller than || 2 if non-zeros are bigger than xz^^e y ^ 

Combining the above bounds, we conclude that H^/JI^ > 8 ^^ ||/3/ r |U if all the non-zero entries 
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of /3* is greater than 



log P 



We now consider \hi \ for i £ (I*) c . We have 



|(J-7>j)(A>.0i.+e)|| 2 < 



\Xj,Pj, +e|| 2 
|| 2 +a e . 

So by independence of Z.- L and Xj* and Corollary|26| we obtain 

I Ail - \Z?(I-Vi)(Xi.Pj.+e)\ 



< - 11/3* 
- 2 " M 



< 



logp, 3 

(o Hp 2 + <7 eJ, 

n 2 



which is smaller than 1 /i ^U IISJ II if all the non-zeros of j3* are bigger than f^r-(||/3*|| 2 



□ 



6 Numerical Simulations 

In this section we provide numerical simulations that corroborate the theoretical results presented 
above, as well as shed further light on the performance of mod-OMP for noisy and missing data. 
Our results illustrate, in particular, several key points. First, in both the low-dimensional and high- 
dimensional settings, empirical results demonstrate that the scaling promised in the corollaries to 
Theorem [6] and Theorem [18] is correct. We demonstrate this by rescaling the error of our experi- 
ments, normalizing by the predicted contribution to the error of n, k and p, in order to highlight the 
dependence on a w . Our experiments show a clear alignment of the actual results along the predicted 
results. The results of this section also show the different regimes of efficacy of our different estima- 
tors for the noisy-covariate setting. Finally, we also compare to 0, and demonstrate that in addition 
to faster running time, we seem to obtain better empirical results in terms of recovery errors. 

We present the low-dimensional results first, and then the high-dimensional results. 



6.1 The Low-Dimensional Case 

We report some simulation results on our low-dimensional results from Section [3] These results 
are also relevant to the high-dimension setting, as our OMP algorithm reduces a high-dimensional 
problem to a low-dimensional one once it identifies the correct support. Note that each of our bounds 
in Corollary 8 to Corollary 16 scales with -^p, which is to be expected. Therefore, we focus on 



verifying the scaling with the other parameters such as k,a w , p and ||/3* 

We first look at the case with additive noise. We fix n = 3200, <r e — 0, Y. x — I and £„, = cr^I, 
and sample all matrices from a Gaussian distribution, k and a w take values in 2, 3, . . . , 7 and [0, 2], 
respectively. For each k, we generate the true j3* as a random ±1 vector; note that ||/3*|| = \fk, 
which also scales with k. Figure T](a) shows the i 2 recovery error under different k and a w using 
the estimator built from knowledge of Yj w , where one can see the quadratic dependence on a w . 
Corollary ^predicts that, with fixed n, the error scales proportional to (a w + a^)\\(3* \\\/k logp — 
(a w + a~Jky/logp; in particular, if we plot the error versus the control parameter (a w + cr^)fc, all 
curves should roughly become straight lines through the origin and align with each other. Indeed, 
this is precisely what we see; the results, representing results averaged over 100 trials, are plotted in 
Figure 0(b). ' 

Similarly, we performed simulations for the estimators built from knowledge of Yj x and from Instru- 
mental Variables. In the latter case, the Instrumental Variable is generated by U = XT + E, where 
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(a) 



Control Parameter 
(b) 



Figure 1: £2 recovery error of the S w -estimator for additive noise versus (a) the noise magnitude 
<r w and (b) the control parameter (a w + u%)k. As predicted by Corollary p3] all curves in (b) are 
roughly straight lines and align. Each point is an average over 200 trials. 




Figure 2: (a) £ 2 recovery error of the E^-estimator for additive noise versus the control parameter 
(1 + <J w )k. (b) I2 recovery error of the IV-estimator versus the control parameter a w k. As predicted 
by Corollary[TT]and[T4] all curves are roughly straight lines and align. Each point is an average over 
100 trials. 



r e jj fexm w j m m = 2k and the entries of T and E being i.i.d. standard Gaussian variables; in 
this case we have <7i(Eux) ~ ^ki^ux) = Q(y/rri) and a u = y/k. Corollaries [TT| and [l4| predict 
that the £2 errors are proportional to the control parameters (1 + a w )k and a w k. respectively. These 
predictions again match well our simulation results shown in Figure|2](a) and (b). 

In addition, we compare the performance of the estimators built from and Y, x . Figure [3] shows 
their recovery error under different a w with k = 7. The results match the theory, and in particular 
show that the scaling depends as predicted on a w : The S w -estimator performs better for small a w , 
and in particular, delivers exact recovery when <j w = 0; the -estimator is more favorable for large 
a w due to its linear dependence on a w (versus quadratic), but the error does not go to zero when 
a w — > 0. The crossover occurs roughly at a w = 1. 
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Figure 3: Comparison between recovery errors of the £„,- and Y. x -estimators for additive noise. 
Each point is an average over 100 trials. 



0.8 r , 0.45 




(a) (b) 

Figure 4: l 2 recovery error for missing noise versus (a) the erasure probability p and (b) the control 
parameter p\fk. Each point is an average over 200 trials. 



Finally, we turn to the case with missing data. We perform simulations with parameters n = 2000, 



k G {2, . . . , 7}, p G 
n fixed, Corollary 16 



0, 0.8], and (3* generated in the same way as above (so that ||/3* || = k). With 
guarantees that the recovery error is bounded by 0( n~^\2 )■ The simulation 
results in Figure pTseem to outperform this bound, as the error goes to zero when p 0. If we 
plot the error versus the control parameter k^-, then the curves become roughly straight lines and 
align. It would be interesting in the future to tighten our bound to match this scaling. 



6.2 The High-Dimensional Case 

In this subsection, we study the performance of our mod-OMP algorithm for the high-dimensional 
setting, and compare with the projected gradient method in [8 |. We first consider the additive noise 
case and use the following settings: p = 450, n = 400, a e = 0, T, x = I, Y, w = u^I, k G 
{2, . . . , 7}, and a w G [0, 1]. We compare mod-OMP using the Em-estimator and the projected 
gradient method using the corresponding £ and j. Figure pi (a) plots the £2 errors. One observes 
that OMP outperforms the projected gradient method in all cases. 
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2 4 6 8 10 12 14 

Control Parameter 



2 4 6 8 10 12 14 

Control Parameter 



1 2 3 4 5 6 7 

Control Parameter 



(a) (b) (c) 

Figure 5: Comparison of the £2 recovery error of mod-OMP and the projected gradient method 
under knowledge of (a) E w , (b) S x , and (c) an Instrumental Variable. The error is plotted against 
the control parameter (a) (<j w + a^)k, (b) (1 + cr w )k, and (c) a w k. Circles correspond to the 
projected gradient method and dots to mod-OMP. As claimed, mod-OMP performs better in all 
cases considered. Each point is an average over 200 trials. 




Figure 6: Support recovery error for the projected gradient algorithm when Y> w is not precisely 
known. Figure (a) shows the results when we use a w = cr'™ c /2, and (b) shows the results for 
&w = &w' e x 2- We note that our approach (which does not require knowledge of Y. w ) has excellent 
recovery throughout this range. Each point is an average over 20 trials. 



We want to point out that mod-OMP enjoys more favorable running time in our experiments, al- 
though we do not perform a formal comparison since this depends on the particular implementation 
of both methods. As is clear from the description of the algorithm, mod-OMP has exactly the same 
running time as standard OMP. 

We also consider mod-OMP with the and IV-based estimators. Although not discussed in JS), 
it is natural to consider the corresponding variants of the projected gradient method which use the 
£ and 7 from knowledge of T, x or IVs (c.f. (13) in JS)). We plot the recovery errors for our two 
estimators in Figure [5] (b) and (c), and again observe better performance of mod-OMP than the 
projected gradient method. 

We further consider robustness of the projected gradient method to over- or under-estimating a w , 
for support recovery. For very low noise, the performance is unaffected; however, it quickly deteri- 
orates as the noise level grows. The two graphs in Figure[6]show this deterioration; in contrast, our 
estimator has excellent support recovery throughout this range. 
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Figure 7: Comparison of the £ 2 recovery error of mod-OMP and the projected gradient method for 
missing data. The error is plotted against the control parameter k ffz^ ■ ( a ) Independent columns of 
X, and (b) Correlated columns. Dots correspond to mod-OMP and circles to the projected gradient 
method. Our results show that mod-OMP performs better in all cases considered. Each point is an 
average over 50 trials. 



We next study the case with missing data with the following setting: p = 750, n = 500, a e — 
0, Sa; = I, k € {2, . . . , 7}}, and p e [0, 0.5]. The results are displayed in Figure [7] (a), in which 
mod-OMP shows better performance. 

Finally, although we only consider X with independent columns in this paper, we believe that this 
restriction can be removed. For now, we corroborate this claim via simulation. Figure [7] (b) shows 
the results under the following choice of covariance matrix of X: 

= {o.2 i £ ). 

Again, mod-OMP dominates the projected gradient method in terms of empirical performance. 



Appendices 

A Proof of Supporting Concentration Results 



In this section, we provide the proofs to the concentration results for sub-Gaussian random variables 
that we make extensive use of in Section [5] We repeat the statements of the results below for 
convenience. 



Lemma 25 Suppose X £ R nxk , Y € R nxm are zero-mean sub-Gaussian matrices with parame- 
ters ^x)' in^v n a v)- Then for any fixed vector vi,V2, we have 

P (|v7 (Y T X - E [Y T X]) v 2 | > 1 1| Vl || ||v 2 ||) < 3exp (- m mml-^,—X) . 

V I <J x tJ y <J x^y J / 

In particular, if n > logp > log m, log k, we have w.h.p. 

\vj (Y T X - E [Y T X]) v 2 | < G x Oy || vi || ||v 2 || 

Setting vi to be the i th standard basis vector, and using a union bound over i — 1, . . . , m, we have 
w.h.p. 

\\{Y^X-^X))v\\^<a^ v M' IVVA,l> 
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Proof. Rescaling as necessary, we assume cr x = a y = 1 and ||vi|| = || va| 
1. Define $(x) = ||x|| 2 - E (||a;|| 2 ) . Then \vj (Y T X - E [Y T X]) v 2 | 



| |$(Xv 2 +Fvi) - $(Iv 2 ) - $(Fvi)|. Note that Iv 2 + Yw x = [X, Y] [vj, vJ] T , where 
X' = [X,Y] is zero-mean sub-Gaussian with parameter (^E [X' T A"'] , i). Applying (70) in El 
to each of the three terms gives 

|v7 (Y T X -E[Y T X])v 2 \ >t, 
with probability at most cxp ( — cn min {t 2 , t}) . □ 



Corollary 26 Tjf X € R™ xfc is a zero-mean sub-Gaussian matrix with parameter (^cr 2 !, ^cr 2 ), and 
v /s a fixed vector in W 1 , then for any e > 1, we have 



X T v|l > \/ (1 + £)fc ^||v|| 2 I <3exp(-cfce) 



Proof. By assmption, X T is zero-mean sub-Gaussian with parameter (i^er 2 /, r^c 2 ) Note that 



have || JSC 1 v|| 2 < |v 1 (XX 1 - ^crj7)v| + ^cr 2 Uv^. Applying the last lemma with t = ^cr 2 e, 
e > 1 to the first term, we obtain 



v T (XX T --a 2 x I)v 
n 



> — <7 2 e ||v|| 2 J < 3 exp (— ck min {e 2 , e}) = 3 exp (—eke) 



The corollary follows. 



□ 



Lemma 27 If X £ R nxk , Y G M nXm are zero mean sub-Gaussian matrices with parameter 
(^T, x , l afj,(^ v ,-(rl),then 



sup 

vieR™,v2eR fc ,||vi||=||v 2 ||=i 



|v7 (Y T X - E [y T X]) v 2 | > tj < 2 exp ^-cnmin( 

fer 2 <T 2 ~| 

In particular, for each A > 0, ifn > max < , 1 > (k + m) logp, fnen w.h.p. 



) + 6(fc + m) . 



sup 

viGR m ,v 2 eI[ 



\vj (Y T X -E[Y T X])v 2 \ < — A||vi||||v 3 | 



Proof. Rescaling as necessary, we assume cr x = a y = 1. Let _4i, be a 1/3-cover of Vi = {v G 
IMI < 1}; it is known that |.Ai| < 9 2m , and for each v, there is a it(v) G .4i such that 
ll A ( v )ll - ll v -w( v )ll < I- Similarly we can find a 1/3-cover _4 2 of v 2 = {v G R fe , ||v|| < 1} 
with |^2 1 < 9 2fe - Defining $(vi,v 2 ) = (Y T X - E [Y T X]) v 2 , then 

sup |$(vi,v 2 )| < max |$(ui,u 2 )| + sup |$(A(vi), u(v 2 ))| 

+ sup |$(«(vi),A(v a ))|+ sup |*(A(vi),A(v 2 ))|. 

vi£vi,v 2 £v2 viGvi,v 2 ev2 

Becaue 3A(vi), u(vi) G Vi, and 3A(v 2 ), u(v 2 ) G v 2 , it follows that 

sup |$(vi,v 2 )| < max u 2 )\ + (- + - + -) sup |$(vi,v 2 )|, 

vi£vi,V2€v2 Ui,-u 2 E^h yo O & J viGvi,v 2 Gv 2 

hence sup Vl6vi V2ev2 |$(vi,v 2 )| < |max Ul) „ 26> t \$(ux,u 2 )\ ■ Using the last lemma and a union 
bound, we obtain 

P|- max |$(ui,u 2 )| > t] < 9 2fc+2m -exp(-cnmin{i 2 ,i}) < exp (-cn min {i 2 , i} + 6(fc + to)) . 
\2 u 1 ,u 2 eA J 

□ 
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